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A computational system for lattice QCD with overlap Dirac quarks* 

Ting-Wai Chiu a , Tung-Han Hsieh a , Chao-Hsi Huang a , Tsung-Ren Huang a 
a Physics Department, National Taiwan University, Taipei, Taiwan 106, Taiwan. 

We outline the essential features of a Linux PC cluster which is now being developed at National Taiwan 
University, and discuss how to optimize its hardware and software for lattice QCD with overlap Dirac quarks. 
At present, the cluster constitutes of 30 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0 
GHz), one Gbyte of PC800 RDRAM, one 40/80 Gbyte hard disk, and a network card. The speed of this system is 
estimated to be 30 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) 
computations in quenched lattice QCD with overlap Dirac quarks. 



1. Introduction 

It is well known that extracting physics from 
lattice QCD requires computing power exceeding 
that of any desktop personal computer currently 
available in the market. Therefore, for one with- 
out supercomputer resources, building a compu- 
tational system seems to be inevitable if one re- 
ally wishes to pursue a meaningful number of 
any physical quantity from lattice QCD. However, 
the feasibility of such a project depends not only 
on the funding, but also on the theoretical ad- 
vancement of the subject, namely, the realization 
of exact chirally symmetry on the lattice Jj],|]]. 
Now, if we also take into account of the current 
price/performance of PC hardware components, 
it seems to be the right timing to rejuvenate the 
project II with a new goal - to build a compu- 
tational system for lattice QCD with exact chiral 
symmetry. In this paper, we outline the essential 
features of a Linux PC cluster which is now be- 
ing developed at National Taiwan University, and 
discuss how to optimize its hardware and software 
for lattice QCD with overlap Dirac quarks. More 
detailed descriptions have been given in Ref. Q. 

First, we start from quenched QCD, and com- 
pute quark propagators in the gluon field back- 
ground, for a sequence of configurations gener- 
ated stochastically with weight exp(— A g ). Then 
the hardronic observables such as meson and 
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baryon correlation functions can be constructed, 
and from which the hadron masses and decay con- 
stants can be extracted. In general, one requires 
that any quark propagator coupling to physical 
hadrons must be of the form || 

{D c + m q )-\ (1) 

where m q is the bare quark mass, and D c is a 
chirally symmetric and anti-hcrmitian Dirac op- 
erator^. For any massless lattice Dirac operator 
D satisfying the Ginsparg- Wilson relation H 

#75 + J 5 D = 2raD l5 D , (2) 

it can be written as D = D c (l + raD c )~ 1 [Q, and 
the bare quark mass is naturally added to the D c 
in the numerator [||, 

D(m q ) = (D c + m q )(l + raD^ 1 . 

Then the quenched quark propagator becomes 

(D c + m^)- 1 = (1 - rmqay^Dirriq)- 1 - ra] . 

If we fix one of the end points at (0, 0) and use the 
Hcrmitcity = 75-D75, then only 12 (3 colors 
times 4 Dirac indices) columns of 

Dim,)- 1 = D^m^iDim^DHm,)}- 1 (3) 

are needed for computing the time correlation 
functions of hadrons. Now our problem is how 

2 Here we assume that D c is doubler-free, and has correct 
continuum behavior, and D = D c (l + raD c )~ 1 is expo- 
nentially local for a range of r > 0. 
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to optimize a PC cluster to compute D(m q )^ 1 
for a set of bare quark masses. 

For overlap Dirac quarks, we need to solve the 
following linear system by (multi-mass) conjugate 
gradient (CG), 



DD f Y = 



where P± 



1 



V P±H U 



(75 ± l)/2, a 



Y = I 



2ml 



(4) 



ml/2, 



7] = 2mj — rriq/2, and mo is fixed to 1.3 in our 
computations. Then the quark propagators can 
be obtained via With Zolotarev optimal ra- 
tional approximation ||[| to (iJ 2 ) -1 / 2 , the mul- 
tiplication (h w = H w /\ min ) 



1 



-.Y ~ h w (h 2 w + c 2n ) 



^ h 2 



>1 



C2/-1 



-Y (5) 



can be evaluated by invoking another (multi- 
mass) conjugate gradient process to the linear 
systems 

{h 2 w + c 2 i-i)Zi 



Y, 1 = 1,- 



(6) 



where the coefficients q , bi and do have been given 
explicitly in Ref. 0. 

In order to improve the accuracy of the rational 
approximation as well as to reduce the number of 
iterations in the inner CG loop, it is crucial to 
project out the largest and some low-lying eigen- 
modes of . Denoting these eigenmodes by 



XjUj, j = l,---,k, (7) 

then we project the linear systems (||) to the com- 
plement of the vector space spanned by these 
eigenmodes 

k 

(h 2 w + cu-i)Z t = y = (1 - £ u ]U ])Y. (8) 

i=i 

Thus the r.h.s. of (||) can be rewritten as 

n k » 

h w (h 2 w +c 2n )Y,biZ l +J2^^u]Y = S (9) 



1=1 



i=i \M? 



Then the breaking of exact chiral symmetry 
can be measured in terms of 

_ \S^S-Y^Y\ 



YW 



(10) 



In practice, one has no difficulties to attain a < 
10 -12 for most gauge configurations on a finite 
lattice (Table|l|). 

Now the computation of overlap Dirac quark 
propagator involves two nested conjugate gradi- 
ent loops: the so-called inner CG loop (||), and 
the outer CG loop (||). 

2. Optimization 

Next we address the question how to configure 
the hardware and software of a PC cluster such 
that it can attain the optimal price/performance 
ratio for the execution of the nested CG loops. 

The vital observation is that not all column vec- 
tors are used simultaneously at any step of the 
nested CG loops, and also the computationally 
intense part is at the inner CG loop (B). Thus 
we can use the hard disk as virtual memory for 
the storage of the intermediate solution vectors 
and their conjugate gradient vectors at each it- 
eration of the outer CG loop, while the CPU is 
working on the inner CG loop. Then the amount 
of required memory at each node can be greatly 
reduced. It is easy to figure out j| the min- 
imum memory required for accommodating the 
link variables and all relevant vectors for the in- 
ner CG loop, 



iV™" = 384 x N s {n + 3) byte 



(11) 



where n is the degree of the Zolotarev rational 
polynomial, and N s is the total number of lat- 
tice sites. For computations of quark propaga- 
tors on the 16 3 x 32 lattice with n = 16, Q 
gives N™ m = 0.912 Gbyte, which can be imple- 
mented in a single node with one Gbyte of mem- 
ory (i.e., four stripes of 256 Mbyte memory mod- 
ules). Then the maximum speed of the cluster 
can be attained since there is no communication 
overheads. Moreover, the time for disk I/O (at 
the interface of inner and outer CG loops) only 
constitutes a few percent of the total time for 
the entire nested CG loops (Table |l|). Further, 
to take advantage of the vector unit of Pentium 
4, we rewrite the computationally intense part 
(H w times a vector Y) of our program in SSE2 
codes pl|Jl^ fl, which yields a speed-up by a fac- 
tor of ~ 1.8. 
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Table 1 

The execution time (in unit of second) of a Pentium 4 (2 GHz) node to compute 12 columns of overlap 
Dirac quark propagators, versus the size of the lattice. The parameters for the test are: the degree of 
Zolotarev rational polynomial is n = 16, the number of bare quark masses is N m = 16 (with ma > 0.02), 
the precision of each projected eigenmode satisfies \\(H^ — A 2 )|a;}|| < 10~ 13 , and the stopping criterion 
for the inner and outer CG loops is e = 10~ n . 







project. 




X sym 


inner CG 


outer CG 


disk I/O 


Total 


Lattice 




no. /time 


^min 1 ^max 


a (max.) 


ave. iters. 


tot. iters. 


time 


time 


8 3 x 24 


5.8 


32/4725 


0.198/6.207 


5.3 x 10~ 14 


403 


1282 





63550 


10 3 x 24 


5.8 


30/7803 


0.152/6.204 


6.4 x 10~ 14 


519 


1943 





218290 


12 3 x 24 


5.8 


30/13258 


0.129/6.211 


9.8 x 10~ 14 


608 


2840 





625718 


16 3 x 32 


6.0 


20/74937 


0.215/6.260 


3.3 x 10~ 13 


370 


3968 


66976 


2095975 



3. Conclusions 

The speed of our system of 30 nodes is higher 
than 30 Gflops, and the total cost of the hard- 
ware is less than US$30,000. This amounts to 
price/performance ratio better than $1.0/Mflops 
for 64-bit (double precision) computations with 
overlap Dirac quarks. The basic idea of optimiza- 
tion is to let each node compute one of the 12 
columns of the quark propagators (for a set of 
bare quark masses) , and also use the hard disk as 
virtual memory for the vectors in the outer CG 
loop, while the CPU is working on the inner CG 
loop. 

With Zolotarev optimal rational approxima- 
tion to (H^)^ 1 / 2 , projection of high and low- 
lying eigenmodes of H^, the multi-mass CG al- 
gorithm, the SSE2 acceleration, and the mem- 
ory management, we are able to compute overlap 
Dirac quark propagators of 16 bare quark masses 
on the 16 3 x 32 lattice, with the precision of quark 
propagators up to 10" 11 and the precision of ex- 
act chiral symmetry up to 10~ 12 , at the rate of 
two gauge configurations ( j3 — 6.0 ) per two days, 
with our present system of 30 nodes (Table [l]). 
This demonstrates that an optimized Linux PC 
cluster can be a viable computational system to 
extract physical quantities from lattice QCD with 
overlap Dirac quarks |13|] . 
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